Naive Bayes - Lesson 1

Features and Labels

Features - example would be songs. Features of the song would 'tempo' and 'intensity', a scatterplot could be made of where a song falls on the spectrum, and then -

Labels - labels would say whether a given subject likes the song or not.

In [1]:
from IPython.display import Image
In [ ]:
#example Naive Bayes code

def NBAccuracy(features_train, labels_train, features_test, labels_test):
    from sklearn.naive_bayes import GaussianNB
    #create classifer
    clf = GaussianNB()

    #time and make the classifer, fit the model
    t0 = time()
    clf.fit(features_train, labels_train)
    print "training time: ", round(time() - t0, 3), "s"

    #time and create predictor, run test
    t1 = time()
    pred = clf.predict(features_test)
    print "Predict time: ", round(time() - t1, 3), "s"

    #calculate accuracy
    from sklearn.metrics import accuracy_score
    accuracy = accuracy_score(pred, labels_test)
    print accuracy
    return accuracy

NBAccuracy(features_train, labels_train, features_test, labels_test)

Bayes Rule

Reminder of the question: the prior probability of cancer is 1%, and a sensitivity and specificity are 90%, what's the probability that someone with a positive cancer test actually has the disease? 8.33 %

In [3]:
Image('capture.png')
Out[3]:
In [4]:
Image('capture1.png')
Out[4]:

Naive Bayes Pros - Easy to implement, simple and effecient to run

Naive Bayes Cons - Phrases don't work well in naive bayes ('chicago bulls' returning images of the city and the animal),

SVM

Parameters in Machine Learning are arguments passed when you create the classifier, before fitting. These make a HUGE DIFFERENCE in the decision boundary the algorithm arrives at.

Main parameters for SVM: KERNEL, C, and GAMMA. C controls the tradeoff between smooth decision boundary and classifying training points correctly.

From quora

https://www.quora.com/What-are-C-and-gamma-with-regards-to-a-support-vector-machine

C is the cost of classification.

A large C gives you low bias and high variance. Low bias because you penalize the cost of missclasification a lot. A small C gives you higher bias and lower variance.

Gamma is the parameter of a Gaussian Kernel (to handle non-linear classification). Check this points:

They are not linearly separable in 2D so you want to transform them to a higher dimension where they will be linearly sepparable. Imagine "raising" the green points, then you can sepparate them from the red points with a plane (hyperplane)

To "raise" the points you use the RBF kernel, gamma controls the shape of the "peaks" where you raise the points. A small gamma gives you a pointed bump in the higher dimensions, a large gamma gives you a softer, broader bump.

So a small gamma will give you low bias and high variance while a large gamma will give you higher bias and low variance.

You usually find the best C and Gamma hyper-parameters using Grid-Search.

Overfitting in machine learning: when the algorithm produces a complex decision boundary, that fits the data very closely, when a simple one would work better. C, Gamma, and Kernel all influence overfitting.

In [5]:
#example SVM code
def SVMAccuracy(features_train, labels_train, features_test, labels_test):
    from sklearn.svm import SVC
    #create classifer
    clf = SVC(kernel = 'rbf', C = 10000.0)
    
    #time and make the classifer, fit the model
    t0 = time()
    
    #reduce training data to 1% of its original size, this reduces the accuracy of the model by 10% 
    #features_train = features_train[:len(features_train)/100]
    #labels_train = labels_train[:len(labels_train)/100]
        
    clf.fit(features_train, labels_train)
    print "training time: ", round(time() - t0, 3), "s"
    
    #time and create predictor, run test
    t1 = time()
    pred = clf.predict(features_test)

    print len(filter(lambda x:x==1, pred))



    print "Predict time: ", round(time() - t1, 3), "s"

    #calculate accuracy
    from sklearn.metrics import accuracy_score
    accuracy = accuracy_score(pred, labels_test)
    print accuracy
    return accuracy

SVMAccuracy(features_train, labels_train, features_test, labels_test)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-ecd9c9a490da> in <module>()
     31     return accuracy
     32 
---> 33 SVMAccuracy(features_train, labels_train, features_test, labels_test)
     34 

NameError: name 'features_train' is not defined

Naive Bayes is great for text--it’s faster and generally gives better performance than an SVM for this particular problem. Of course, there are plenty of other problems where an SVM might work better. Knowing which one to try when you’re tackling a problem for the first time is part of the art and science of machine learning. In addition to picking your algorithm, depending on which one you try, there are parameter tunes to worry about as well, and the possibility of overfitting (especially if you don’t have lots of training data).

Our general suggestion is to try a few different algorithms for each problem. Tuning the parameters can be a lot of work, but just sit tight for now--toward the end of the class we will introduce you to GridCV, a great sklearn tool that can find an optimal parameter tune almost automatically.

Decision Trees

Allow you to ask multiple linear questions, one after the other.

In [5]:
Image('capture2.png')
Out[5]:

Parameters to tune that will affect the deicision boundary:

min_samples_split - is the threshold at which the decision tree will stop splitting into a new leaf. default value is 2

ENTROPY - Controls how a DT decides where to split the data. Definition: measure of IMPURITY in a bunch of examples.

Impurity is how well the data was split by the decision boundary. If points bleed over then the data is less pure than if the DB cleanly separates the data.

Entropy formula is -(# of class 1 / total of class) * log base 2((# of class 1 / total of class)) + same formula for other class labels. value should be between 0 and 1.0

Information Gain in a Decision Tree:

information gain = entropy(parent) - [weighted average]entropy(children)

Decision tree algorithm maximizes information gain

In [6]:
#scipy 2 liner to calculate entropy
import scipy.stats
print scipy.stats.entropy([2,1],base=2) 
0.918295834054
In [7]:
Image('capture4.png')
Out[7]:
In [8]:
#scipy 2 liner to calculate entropy
import scipy.stats
print scipy.stats.entropy([2,2],base=2) 
1.0

Bias-Variance Dilemma

A high-bias machine learning algorithm practically ignores the data. A high-variance algorithm is extremely perceptive to data and can only replicate what its seen before, will react very poorly to data it hasn't seen before. You want something in the middle.

MORE DATA IS BETTER THAN A FINE-TUNED ALGORITHM

Enron Dataset Additional Areas of Interest:

List of POI's

Discrete Supervised Learning - The output (y-axis) are fixed values (in-college, or not in-college, 1 or 0)

Continuous Supervised Learning - The output are not fixed values (how height and weight vary as a person gets taller). There is some kind of ordering.

In [7]:
Image('capture10.png')
Out[7]:
In [ ]:
#example linear regression

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(ages_train, net_worths_train)

#additional regression functionality

print 'Net Worth Prediction: ', reg.predict([27])[0][0]
print 'slope', reg.coef_[0][0]
print 'intercept', reg.intercept_[0]

print 'STATS ON TEST DATASET'
print 'r-squared score: ', reg.score(ages_test, net_worths_test)

print "STATS ON TRAINING DATASET"
print 'r-squared score: ', reg.score(ages_train, net_worths_train)

Performance Metrics for Regression:

R-Squared: higher value is better, max of 1

The best regression is the one that minimizes the sum of squared errors.

Several algorithms to accomplish this:

Ordinary Least Squares (OLS)- used in sklearn LinearRegression

Gradient Descent

R Squared - Evaluation metric for a regression. answers the question 'How much of my change in the output (y) is explained by the change in my input (x)?'. Value will be between 0 (line is not doing a good job of caputring trend in the data) and 1 (regression line perfectly captures the trend in the data).

In [9]:
Image('capture5.png')
Out[9]:

Outlier Detection - Train, remove ~10% highest residual error points, train again. can repeat this process

Clustering - for data that is not already labeled.

K-Means Clustering - most used clustering algorithm. 2 steps to the algorithm - assign and optimize. Assign the data points to a random 'center', move the 'center' to minimize the quadratic distance between points, assign again, and repeat until the 'center' of the cluster cannot be optimized anymore.

class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')

most important parameters: n_clusters, max_iter, n_init

Limitations - output for a fixed training set will not always be the same. 'Local Minimum' can lead to odd (or unintuitive) clustering.

In [ ]:
#example k-means code

from sklearn.cluster import KMeans

import numpy as np

clf = KMeans(n_clusters = 2)

pred = clf.fit_predict(finance_features)

Feature Scaling SVM and K-Means Clustering use feature scaling. This process will not affect other algorithms such as decision trees or linear regression

In [6]:
Image('capture6.png')
Out[6]:
In [8]:
Image('capture7.png')
#formula for feature scaling
Out[8]:
In [10]:
#numpy and sklearn implementation of the feature scaler
from sklearn.preprocessing import MinMaxScaler
import numpy

#expects floats
weights = numpy.array([[115.], [140.], [175.]])
scaler = MinMaxScaler()

#can do fit OR transform, here is both in one line
rescaled_weight = scaler.fit_transform(weights)
rescaled_weight
Out[10]:
array([[ 0.        ],
       [ 0.41666667],
       [ 1.        ]])

Text Learning - input strings can be any length. The solution to this problem is to use a 'bag of words' which is a dict that holds key words, and the counts of those words that appear in the string are stored in the dict.

Word Order does not matter.

Long phrases will give different vectors.

Complex phrases ("Chicago Bulls") are more challenging to handle.

In sklearn bag of words is called Count Vectorizer

Low Information words - don't have much information (the, will, hi)

Stopwords - low information word that occurs very frequently (and, the, I, you, have). should remove them to clean data

Stemmer - reduces a group of words to a stem (responsivity to respon)

Tf - Term Frequency (like bag of words)

Idf - inverse document frequency - weighting by how often word occurs in corpus, rare words are weighted more

In [4]:
#sklearn implementation example

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

#string examples
string1 = "hi Katie the self driving car will be late Best Sebastian"
string2 = "Hi Sebastian the machine learning class will be great great great Best Katie"
string3 = "Hi Katie the machine learning class will be most excellent"

email_list = [string1, string2, string3]

bag_of_words = vectorizer.fit(email_list)
bag_of_words = vectorizer.transform(email_list)

print bag_of_words

#below output can be read like this (0, 0), first digit is the document, 
#second digit is the word #, # to the right is the count
  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 4)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (0, 16)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (1, 6)	3
  (1, 7)	1
  (1, 8)	1
  (1, 10)	1
  (1, 11)	1
  (1, 13)	1
  (1, 15)	1
  (1, 16)	1
  (2, 0)	1
  (2, 3)	1
  (2, 5)	1
  (2, 7)	1
  (2, 8)	1
  (2, 10)	1
  (2, 11)	1
  (2, 12)	1
  (2, 15)	1
  (2, 16)	1
In [5]:
#how to get word location for a specific word
print vectorizer.vocabulary_.get("great")
6
In [ ]:
#Stemmer example, run in interpeter

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')

stemmer.stem("responsiveness")
#output 'respons'

stemmer.stem('responsivity')
#output 'respons'

stemmer.stem('unresponsive')
#output 'unrespons'

Feature Selection

Be careful when engineering new features:

Anyone can make mistakes--be skeptical of your results!

100% accuracy should generally make you suspicious. Extraordinary claims require extraordinary proof.

If there's a feature that tracks your labels a little too closely, it's very likely a bug!

If you're sure it's not a bug, you probably don't need machine learning--you can just use that feature alone to assign labels.

Features do not equal information

There are two big univariate feature selection tools in sklearn: SelectPercentile and SelectKBest. The difference is pretty apparent by the names: SelectPercentile selects the X% of features that are most powerful (where X is a parameter) and SelectKBest selects the K features that are most powerful (where K is a parameter).

In [2]:
Image('capture8.png')
Out[2]:
In [5]:
Image('capture9.png')
Out[5]:

The sweet spot of the bias-variance dilemma is few features, large r squared, low sse

want to balance errors with the number of features required to get those errors

regularization - automatic process of some algorithms that will optimally select the # of features (can be used in regression)

Lasso Regression - minimizes SSE but with an additional formula to penalize for adding more features.

In [ ]:
#example lasso regression code

import sklearn.linear_model.Lasso
features, labels #get data
regression = Lasso()
regression.fit(features, labels)
regression.predict([2,4])

#to see the coeffecients, or what the algorithm decided was not important or was important 
print regression.coef_

Principal Component Analysis - PCA

In [2]:
Image('capture11.png')
Out[2]:
In [3]:
Image('capture12.png')
Out[3]:
In [ ]:
#example of sklearn implementation of PCA

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(data)

#get the information out of the PCA object
print pca.explained_variance_ratio_
first_pc = pca.components_[0]
second_pc = pca.components_[1]

#get the explained_variance_ratio for each feature
print pca.explained_variance_ratio_
In [4]:
Image('capture111.png')
Out[4]:

Validation

Splitting data into training and testing sets. This will allow you to guard against overfitting and see how well the algorithm is performing.

http://scikit-learn.org/0.17/modules/cross_validation.html

In [ ]:
from sklearn.cross_validation import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, 
                                                                            labels, 
                                                                            test_size=0.4, 
                                                                            random_state = 0)
In [3]:
Image('capture.1.png')
Out[3]:
In [4]:
Image('capture.2.png')
Out[4]:
In [6]:
Image('capture.3.png')

'''If our original data comes in some sort of sorted fashion, then we will want to first shuffle 
the order of the data points before splitting them up into folds, or otherwise randomly assign 
data points to each fold. If we want to do this using KFold(), then we can add the "shuffle = True" 
parameter when setting up the cross-validation object.

If we have concerns about class imbalance, then we can use the StratifiedKFold() class instead. 
Where KFold() assigns points to folds without attention to output class, StratifiedKFold() assigns 
data points to folds so that each fold has approximately the same number of data points of each 
output class. This is most useful for when we have imbalanced numbers of data points in your outcome 
classes (e.g. one is rare compared to the others). For this class as well, we can use "shuffle = True" 
to shuffle the data points' order before splitting into folds.
'''
Out[6]:
In [ ]:
#GridSearchCV is a way to test multiple combos of parameter tunes, cross-validating
#as you go to find the ones that give the best performance

#dict of parameters and the possible values they could have
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
#what kind of algorithm to use
svr = svm.SVC()
#create classifier, pass algorithm and dict of parameters
clf = grid_search.GridSearchCV(svr, parameters)
#fit function tries all parameter combos and returns a fitted classifier thats automatically
#tuned to the optimal parameter combination
#access parameter values via clf.best_params_
clf.fit(iris.data, iris.target)

Evaluation Metrics

In [8]:
Image('capture.4.png')
Out[8]:
In [10]:
Image('capture.5.png')
Out[10]:

Recall - nearly every time a POI shows up in my test set, I am able to identify him/her. Sometimes you get false positives though, where non-poi's get flagged.

Precision - when a person gets flagged you know with a lot of confidence it is a real POI, sometimes you will miss real POI's though since you are reluctant to pull the trigger on edge cases.

Precision is the probability that a (randomly selected) retrieved document is relevant.
Recall is the probability that a (randomly selected) relevant document is retrieved in a search.

Conclusion

4 main topics:

  1. Dataset and question, can you find the data and define a question
  2. feature selection
  3. algorithm and parameter tunes
  4. validation/evaluation
In [11]:
Image('capture.6.png')
Out[11]: